Skip to content

Implementation of the Knuth-Morris-Pratt (KMP) string matching algorithm #403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 10, 2021

Conversation

sjathin
Copy link
Contributor

@sjathin sjathin commented Jun 5, 2021

References to other Issues or PRs or Relevant literature

Fixes #400.
This PR includes the implementation of KMP String matching algorithm.

Brief description of what is fixed or changed

The Knuth-Morris-Pratt algorithm also known as KMP is a string matching algorithm that turns the search string into a finite state machine, then runs the machine with the string to be searched as the input string. Execution time is O(m+n), where m is the length of the search string, and n is the length of the string to be searched.[1]

@sjathin sjathin requested a review from czgdp1807 June 5, 2021 20:43
@codecov
Copy link

codecov bot commented Jun 5, 2021

Codecov Report

Merging #403 (fff59fe) into master (0dd2c03) will increase coverage by 0.043%.
The diff coverage is 100.000%.

@@              Coverage Diff              @@
##            master      #403       +/-   ##
=============================================
+ Coverage   98.574%   98.618%   +0.043%     
=============================================
  Files           25        26        +1     
  Lines         3297      3401      +104     
=============================================
+ Hits          3250      3354      +104     
  Misses          47        47               
Impacted Files Coverage Δ
pydatastructs/strings/__init__.py 100.000% <100.000%> (ø)
pydatastructs/strings/algorithms.py 100.000% <100.000%> (ø)
pydatastructs/linear_data_structures/__init__.py 100.000% <0.000%> (ø)
pydatastructs/linear_data_structures/algorithms.py 99.715% <0.000%> (+0.061%) ⬆️

Impacted file tree graph

False

"""
return eval(algorithm + "('" + text + "','" + pattern + "')")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using eval. Please use the pattern similar to the one shown below,

import pydatastructs.graphs.algorithms as algorithms
func = "_minimum_spanning_tree_" + algorithm + "_" + graph._impl
if not hasattr(algorithms, func):
raise NotImplementedError(
"Currently %s algoithm for %s implementation of graphs "
"isn't implemented for finding minimum spanning trees."
%(algorithm, graph._impl))
return getattr(algorithms, func)(graph)

return eval(algorithm + "('" + text + "','" + pattern + "')")


def kmp(string: str, substring: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to name it as, _knuth_morris_pratt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation is not needed here as it would be a non-public function.

'find_string'
]

def find_string(text: str, pattern: str, algorithm: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for this should have the list of supported algorithms. For example,

algorithm: str
The algorithm which should be used for
computing a minimum spanning tree.
Currently the following algorithms are
supported,
'kruskal' -> Kruskal's algorithm as given in
[1].
'prim' -> Prim's algorithm as given in [2].

Full doc string of the above example is as follows,

"""
Computes a minimum spanning tree for the given
graph and algorithm.
Parameters
==========
graph: Graph
The graph whose minimum spanning tree
has to be computed.
algorithm: str
The algorithm which should be used for
computing a minimum spanning tree.
Currently the following algorithms are
supported,
'kruskal' -> Kruskal's algorithm as given in
[1].
'prim' -> Prim's algorithm as given in [2].
Returns
=======
mst: Graph
A minimum spanning tree using the implementation
same as the graph provided in the input.
Examples
========
>>> from pydatastructs import Graph, AdjacencyListGraphNode
>>> from pydatastructs import minimum_spanning_tree
>>> u = AdjacencyListGraphNode('u')
>>> v = AdjacencyListGraphNode('v')
>>> G = Graph(u, v)
>>> G.add_edge(u.name, v.name, 3)
>>> mst = minimum_spanning_tree(G, 'kruskal')
>>> u_n = mst.neighbors(u.name)
>>> mst.get_edge(u.name, u_n[0].name).value
3
References
==========
.. [1] https://en.wikipedia.org/wiki/Kruskal%27s_algorithm
.. [2] https://en.wikipedia.org/wiki/Prim%27s_algorithm
Note
====
The concept of minimum spanning tree is valid only for
connected and undirected graphs. So, this function
should be used only for such graphs. Using with other
types of graphs may lead to unwanted results.

Adding note is optional in a doc string.

return patterns


def _doMatch(string: str, substring: str, patterns: OneDimensionalArray) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow snake case instead of camel case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_doMatch -> _do_match. It would be better if we define this function inside _knuth_morris_pratt as for now it is called only inside its scope.

Comment on lines 7 to 26
def _test_common_string_matching(algorithm):
true_text_pattern_dictionary = {
"Knuth-Morris-Pratt": "-Morris-",
"abcabcabcabdabcabdabcabca": "abcabdabcabca",
"aefcdfaecdaefaefcdaefeaefcdcdeae": "aefcdaefeaefcd",
"aaaaaaaa": "aaa",
"fullstringmatch": "fullstringmatch"
}
for test_case_key in true_text_pattern_dictionary:
assert find_string(test_case_key, true_text_pattern_dictionary[test_case_key], algorithm) is True

false_text_pattern_dictionary = {
"Knuth-Morris-Pratt": "-Pratt-",
"abcabcabcabdabcabdabcabca": "qwertyuiopzxcvbnm",
"aefcdfaecdaefaefcdaefeaefcdcdeae": "cdaefaefe",
"fullstringmatch": "fullstrinmatch"
}

for test_case_key in false_text_pattern_dictionary:
assert find_string(test_case_key, false_text_pattern_dictionary[test_case_key], algorithm) is False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on test cases.

return _doMatch(string, substring, patternsInSubString)


def _buildPattern(substring: str) -> OneDimensionalArray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestions as in _doMatch.

from . import trie
from . import (
trie,
string_matching_algorithms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename the file as algorithms.py from string_matching_algorithms.py. We would keep all the string related algorithms in this file.

@czgdp1807
Copy link
Member

Thanks for the PR. Left some suggestions.

@czgdp1807 czgdp1807 merged commit 7878ee4 into codezonediitj:master Oct 10, 2021
@czgdp1807
Copy link
Member

Thanks @sjathin for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Knuth–Morris–Pratt(KMP) algorithm
2 participants